A Novel Deep Learning Model for Accurate Pest Detection and Edge Computing Deployment

Simple Summary This research proposes a novel attention mechanism for the task of rice pest detection, aiming to address the issues of complex backgrounds and small size of pests. By dynamically adjusting attention weights, the model effectively focuses on small-scale pests, avoiding distractions from complex background information. Concurrently, we adopt a multi-scale feature fusion technique, successfully extracting rich and distinctive features, thereby further enhancing the model’s performance. Numerous experiments demonstrate superior performance of our model over advanced methods like YOLO, EfficientDet, RetinaDet, and MobileNet in pest detection tasks. Overall, through innovative attention mechanism and feature fusion techniques, our work effectively tackles the critical issues in pest detection, achieving excellent detection results. Abstract In this work, an attention-mechanism-enhanced method based on a single-stage object detection model was proposed and implemented for the problem of rice pest detection. A multi-scale feature fusion network was first constructed to improve the model’s predictive accuracy when dealing with pests of different scales. Attention mechanisms were then introduced to enable the model to focus more on the pest areas in the images, significantly enhancing the model’s performance. Additionally, a small knowledge distillation network was designed for edge computing scenarios, achieving a high inference speed while maintaining a high accuracy. Experimental verification on the IDADP dataset shows that the model outperforms current state-of-the-art object detection models in terms of precision, recall, accuracy, mAP, and FPS. Specifically, a mAP of 87.5% and an FPS value of 56 were achieved, significantly outperforming other comparative models. These results sufficiently demonstrate the effectiveness and superiority of the proposed method.


Introduction
With the continued growth of the global population, the pressure on the food supply is increasing. As one of the world's major food crops, the improvement of rice yield has significant implications for ensuring global food security. However, in the process of rice production, pest infestations are one of the main factors affecting yield and quality [1,2]. Pest occurrences are closely related to the formation of gut microbes within the host [3]. Parasites not only manipulate hosts but also effectively utilize them, significantly impacting plants [4]. Traditional pest control methods mainly rely on manual field inspection [5], followed by the selection of appropriate control strategies based on the pest species and density [6]. However, this method is inefficient, time-consuming, labor-intensive, and challenging to implement on a large scale [7].
In the field of rice pest detection research, despite the existence of many studies using attention mechanisms, our work differs from them in several key aspects. First, our attention mechanism is specially designed to address specific issues in pest detection tasks, namely, the complex backgrounds and small size of pests. Our model, by dynamically adjusting attention weights, can focus more effectively on small-sized pests, without being distracted by complex background information. Meanwhile, our attention mechanism operates at the feature level, which also helps to capture pest features at different scales. Second, we have not only theoretically designed this novel attention mechanism but also validated its effectiveness through a large number of experiments. These include ablation studies based on different features and comparisons with other advanced methods (such as YOLO, EfficientDet, RetinaDet, and MobileNet). The results of these experiments consistently demonstrate the superior performance of our model in pest detection tasks.
Another significant innovation of this study is the design of a small knowledge distillation network for edge computing scenarios. Due to the limited computing capabilities and storage space of edge computing devices, designing a lightweight model that maintains a high precision is crucial for practical applications. The small network distills knowledge from the attention-mechanism-enhanced model, significantly reducing the size and computational complexity of the model while maintaining a high accuracy. This makes it more suitable for deployment on edge computing devices.
A series of experiments were conducted to validate the performance of the proposed model. The IDADP dataset, which contains images of six types of rice pests with high resolutions and diverse backgrounds, was used for training and testing. It is ideal for testing the performance of the model in complex environments. Experimental results show that the proposed model outperforms existing models, such as RetinaDet [25], EfficientDet [26], YOLOv5 [27], YOLOv8 [28], FasterRCNN [29], and MaskRCNN [30], in terms of accuracy, recall, precision, mAP, and FPS. Furthermore, ablation experiments were conducted on different attention mechanisms and data augmentation strategies, further verifying the effectiveness of the proposed model and strategies. Notably, the small knowledge distillation network significantly outperforms the large network in inference speed with only a minor loss in accuracy, making the model very suitable for deployment on edge devices with limited computing capabilities. Finally, the main innovative points of this paper are as follows: 1.
We proposed a novel attention mechanism that specifically addresses the issues of the complex backgrounds and small size of pests.

2.
We used multi-scale feature fusion techniques to effectively extract richer, more distinguishable features, thereby enhancing the performance of the model.

3.
Through a large number of ablation experiments and comparisons with other advanced methods, we have validated the effectiveness and superiority of our model.
In summary, our work effectively solves key problems in pest detection by proposing a novel attention mechanism and using multi-scale feature fusion techniques, achieving a superior detection performance.

Related Work
In this section, a discussion will be presented regarding work pertinent to this paper, including deep learning models extensively employed for object detection tasks. The principles behind these models will be briefly introduced, supplemented by necessary mathematical formulae.

RetinaDet
RetinaDet is a single-stage object detection model based on Focal Loss, which realizes a high detection speed while maintaining accuracy. The pivotal innovation of RetinaDet lies in the introduction of Focal Loss, aiming to solve class imbalance problems. The definition of Focal Loss is as follows: Here, p t represents the probability predicted by the model and γ is a hyperparameter to control the degree of attention that the loss function pays to simple and hard samples.

EfficientDet
EfficientDet is a multi-scale feature fusion object detection model based on EfficientNet. Its innovation lies in the introduction of a new network structure, Compound Scaling, for simultaneously optimizing the network depth, width, and input resolution. EfficientDet introduces a new feature fusion module, BiFPN (Bidirectional Feature Pyramid Network), enabling the model to more effectively fuse feature information at different levels and thus enhancing model detection accuracy. The update formula for BiFPN is as follows: where F n i represents the ith layer feature in the nth iteration, w n j represents the weight of the jth layer feature in the nth iteration, and the weights are learned.

YOLOv5
YOLOv5 is a variant of the YOLO series of object detection models. It further optimizes YOLOv4, enhancing the detection speed and accuracy of the model. YOLOv5 makes some improvements in the network structure, such as introducing CIOU loss to replace the original GIoU loss, to more accurately measure the overlap between prediction boxes and actual boxes. The formula for CIOU loss is: where IOU represents the intersection over union of the predicted and actual boxes, d is the distance between the centers of the predicted and actual boxes, c is the diagonal length of the smallest enclosing rectangle containing the predicted and actual boxes, and ar and ap represent the aspect ratios of the actual and predicted boxes, respectively.

YOLOv8
YOLOv8 is the latest version of the YOLO series of object detection models. It further improves upon YOLOv7, enhancing the model detection accuracy. YOLOv8 makes a series of improvements to the network structure, such as introducing new attention modules and convolution modules to enhance the model's feature extraction capabilities and receptive field.

Faster R-CNN
Faster R-CNN is an improved version of the R-CNN series models. It introduces a Region Proposal Network (RPN) to the original R-CNN model to accelerate the generation of object candidate regions. The objective function of Faster R-CNN is as follows: Here, p i denotes the probability of the ith anchor being an object; t i denotes the coordinates of the ith anchor; p * i and t * i , respectively, represent the true label and coordinates of the ith anchor; and L cls , and L reg , respectively, represent the classification loss and regression loss.

Mask R-CNN
Mask R-CNN is an extension of Faster R-CNN. It introduces a parallel branch to Faster R-CNN for generating object segmentation masks, enabling Mask R-CNN to achieve pixel-level object segmentation while conducting object detection. The loss function of Mask R-CNN, in addition to the classification loss and regression loss of Faster R-CNN, includes a mask loss: where L mask represents the mask loss, used to measure the difference between the predicted mask and the actual mask. These comparative models used in the experiments each have their strengths and weaknesses. Without exception, they have all made significant contributions to the development of object detection tasks. In the following section, an attention mechanism enhancement based on a single-stage object detection model, a multi-scale feature fusion network construction, and a small network design for edge computing scenarios through knowledge distillation will be introduced.

Materials
This section elucidates the datasets employed in the study, highlighting their characteristics, along with the data augmentation strategy implemented.

IDADP Dataset Analysis
The IDADP (Insect Detection and Analysis in Digital Pictures) dataset is tailored for insect detection and analysis tasks. It encompasses images of six types of rice pests: Spodoptera litura, Chilo suppressalis, Leptocorisa chinensis, Cnaphalocrocis medinalis, Locusta migratoria manilensis, and Sogatella furcifera, as shown in Figure 1. Within the IDADP dataset, each image encapsulates one or more types of rice pests, with every pest instance signified by a bounding box and a category label. The bounding box delineates the pest's location in the image, while the category label indicates the type of pest. Each type of pest is represented in approximately equal quantities of images, resulting in a balanced distribution of categories in the dataset.
The images in the IDADP dataset are of varied resolutions, reaching up to 4 K. These high-resolution images provide rich detail, aiding the model in detecting and categorizing pests more accurately. However, they also demand a higher computational power and processing speed from the model.

Data Augmentation
A suite of data augmentation techniques, including Cutout, Cutmix, and Mosaic [31], were utilized in this study to enhance the model's generalization capability and robustness, as shown in Figure 2.

Cutout
Cutout is a data augmentation strategy that simulates occlusions in images by randomly selecting a region and setting its pixel values to 0, as shown in Figure 2A. The process can be represented by the following equation: Here, X is the original image and M is a binary mask indicating which pixel locations should be set to 0, as shown in Figure 3.

Cutmix
Cutmix is a data augmentation strategy that blends two images, as shown in Figure 2C. It randomly selects a region in the first image and replaces it with the corresponding region from a second image. The process can be represented by the following equation: Here, X 1 and X 2 are the two original images and M is a binary mask indicating the pixel locations originating from X 1 and X 2 , respectively.

Mosaic
Mosaic is a data augmentation strategy that concatenates four images, as shown in Figure 2B. The images are first scaled to the same size then stitched together in a certain order.

Data Augmentation Using GAN Models
In addition, Generative Adversarial Networks (GANs) were utilized for data augmentation. During the process, a GAN model is trained using the original data, and then this model is used to generate new image data. The GAN model consists of a generator G and a discriminator D, where the generator G attempts to produce fake images indistinguishable from real ones, while the discriminator's D task is to distinguish whether the input image is real or generated by the generator. The training process can be expressed by the following equation: Here, x represents samples from real data, z is the random noise input to the generator, D(x) is the discriminator's judgment on real images, and D(G(z)) is the discriminator's judgment on generated images. When generating new image data, noise z is first sampled from a preset random distribution and then input to the trained generator to produce new images. This process can be represented as follows: Through the above data augmentation strategies and data expansion using GAN models, the diversity of training samples can be significantly increased, thereby enhancing the model's generalization ability and robustness, which in turn yields improved results in practical applications. The proposed methods will be introduced in the following section.

Proposed Method
This section details the method proposed, which is based on a single-stage object detection model. The method primarily comprises three innovative aspects: attention mechanism enhancement based on the single-stage object detection model, the construction of a multi-scale feature fusion network, and designing a small knowledge distillation network tailored for edge computing scenarios. The overview of the proposed method flow is shown in Figure 4.

Attention Mechanism Enhancement Based on Single-Stage Object Detection Model
In the proposed method, an attention mechanism is introduced into the single-stage object detection model, enhancing the model's focus on targets and thereby improving the precision of object detection. Specifically, an attention module is added to every layer of the feature extraction network. This module can generate an attention map to guide the model to focus more on areas where the target is located.
The specific operation of the attention module can be expressed by the following formula: Here, F represents the input feature map, W and b are parameters of the attention module, σ is the sigmoid activation function, and A is the generated attention map. After obtaining the attention map, it is multiplied with the original feature map to obtain the enhanced feature map: Here, ⊗ represents element-wise multiplication. This approach allows the model to focus more on the area where the target is located, thereby improving the precision of object detection.

Significance of Multi-Scale Feature Fusion in Pest Identification
The application of the multi-scale feature fusion technique in this study aims to effectively integrate features from different layers of the network, extracting richer and more distinguishable features to enhance the model performance. This study opted for convolutional neural networks (CNNs) as the baseline model. CNNs possess the ability to extract local features from images. As the network deepens, its feature extraction ability gradually transitions from basic edges and textures to more advanced shapes and parts, as demonstrated in Figure 5. Therefore, features from different layers can be considered features of different scales. For the task of pest identification, lower-level (shallow) features may include edges, colors, and textures, which play a crucial role in identifying the species and physiological states of pests (such as larvae or adults). In contrast, higher-level (deep) features can extract the overall shape, size, and other global information about the pest, aiding in distinguishing different pest species. Therefore, effectively integrating these features of different layers and scales allows the model to obtain richer and more comprehensive pest feature information. In our task, multi-scale feature fusion played a significant role in two aspects:

1.
Pest morphological features exhibit different characteristics on different scales. For instance, at the macro level, we can observe the overall shape, color, and texture of the pest, which are crucial for distinguishing different pest species. On the micro level, we can observe some details of the pest's body, such as the shape and texture of scales, antennae, wings, etc. These features assist us in more accurately identifying pests.

2.
Pests exhibit significant variations in their morphological features at different stages of growth. For example, the shape, size, and color of a pest's larvae and adult form may differ completely. This necessitates our model's capability to adapt to such changes and capture the features of pests at different stages. Through multi-scale feature fusion, our model can acquire pest features at different scales, thereby better adapting to the morphological changes of pests and improving identification accuracy.

Implementation of Multi-Scale Feature Fusion
As discussed above, a multi-scale feature fusion network is built to extract and fuse features of different scales. Specifically, convolution operations with kernels of different scales are first performed on the input image, generating a series of feature maps of different scales. These feature maps are then fused through a series of upsampling and downsampling operations to obtain the final feature map. In this task, a multi-scale feature fusion network (MSFFN) is utilized to fully use the information of the target on different scales, thereby improving the performance of the model: 1.
Network structure: The structure of the MSFFN mainly includes a backbone network and multiple feature fusion modules.

2.
Backbone network: EfficientNet is chosen as the backbone network, which can provide rich and multi-scale feature maps. EfficientNet achieves a high performance while maintaining a low complexity through balanced expansion in network depth, width, and resolution. In this task, EfficientNet-B0 is chosen as the backbone network. 3.
Feature fusion module: The feature fusion module mainly includes convolution layers, upsampling layers, and downsampling layers. These layers are used for the fusion and processing of feature maps of different scales. Specifically, convolution layers are first used to extract local information from feature maps, then the scale of the feature maps is adjusted to be consistent through upsampling and downsampling operations, and, finally, these feature maps are fused. The fusion process of features can be represented by the following mathematical formula: Here, F i is the feature map of the ith layer; U p and Down represent upsampling and downsampling operations, respectively; Conv represents the convolution operation; and ⊕ represents the fusion of feature maps, which can be an addition or concatenation operation. 4.
Channel number of the feature map: In the MSFFN, the number of channels in the feature map is predominantly adjusted through the convolution layer. To be specific, a convolution layer is implemented in each feature fusion module to adjust the number of channels in the feature map. This adjustment enables the maintenance of a consistent number of channels whilst fusing the feature maps.
Through such a design, the MSFFN can fully utilize the object's information at different scales, thereby enhancing the model's performance.

Design of a Small Network for Edge Computing Scenarios Using Knowledge Distillation
In edge computing scenarios, due to hardware resource limitations, there is a need to design a lightweight model for object detection. Therefore, a method of knowledge distillation is introduced, allowing the large model to transfer knowledge to the small model. This transfer facilitates the small model in maintaining a high accuracy while meeting the requirements of edge computing. The knowledge distillation process consists of two stages: initially, a large model (also known as the teacher model) is trained, followed by the training of a small model (also known as the student model). During the training process, the student model learns not only the label information of the data but also the prediction results of the teacher model. The loss function of knowledge distillation can be expressed by the following formula: Here, L CE represents the cross-entropy loss of the student model, L KD denotes the loss of knowledge distillation, and α is a balance coefficient. L KD can be calculated using the following formula: Here, S is the prediction result of the student model, T is the prediction result of the teacher model, and KL is the Kullback-Leibler divergence used to measure the similarity between two distributions.
Through this method, the small model can learn the knowledge of the large model, thereby achieving efficient object detection in edge computing scenarios. In this task, the process of knowledge distillation primarily consists of two steps: first, training a large network (teacher network) and then using the output of this network to guide the training of the small network (student network), as shown in Figure 6. The training processes are as follows: 1.
Training of the teacher network: The teacher network is typically a large, deep network, such as the single-stage object detection model enhanced by the attention mechanism in this study. This network can have more parameters and a deeper network structure during training, thus acquiring more features and information. Then, the IDADP dataset is used to train this large network to optimize its performance in the object detection task.

2.
Training of the student network: The student network is typically a smaller, shallow network. Its purpose is to maintain a high performance while reducing the computation and storage requirements. When training the student network, not only is the standard loss function (such as cross-entropy loss) used but also the output of the teacher network, which is used as a "soft label" to guide the training of the student network. Specifically, the KL divergence between the output of the teacher network and the student network is calculated as an additional loss to force the student network to mimic the behavior of the teacher network. This additional loss function can be expressed as follows: Here, L CE denotes the cross-entropy loss; z s and z t represent the logits of the student network and the teacher network, respectively; T is a temperature parameter; and α is a weight parameter used to balance the cross-entropy loss and the KL divergence loss.
In this way, the student network not only learns the true labels of the data but also learns the behavior of the teacher network, thereby achieving the goal of maintaining a high detection performance while reducing the model size.

Experiment Settings
For the experiments, a server equipped with an Nvidia Tesla V100 GPU and an Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20 GHz was used for training and testing. The server runs on the Ubuntu 18.04.3 LTS operating system. From the software perspective, Python 3.7.10 was used as the language for code development and execution, PyTorch 1.8.1 was the primary deep learning framework, OpenCV 4.5.2 was used for image processing, and other libraries, such as numpy 1.20.1 and scikit-learn 0.24.1, were used for data processing and model evaluation. All the code was developed and run in this hardware and software environment.
The model training employed the Adam optimizer with an initial learning rate set at 0.001. A learning rate decay strategy was implemented, reducing the learning rate to 10% of the original rate every 20 epochs. The model was trained for a total of 100 epochs. The batch size was set to 32, determined by the GPU memory capacity of the hardware platform. Models in comparative experiments, including RetinaDet, EfficientDet, YOLOv5, YOLOv8, FasterRCNN, and MaskRCNN, were all of the latest versions and were set with respective default parameters according to their official documentation. All models were trained and tested under the same environment and datasets to ensure experimental fairness.

Experiment Metric
The experiments employed performance metrics, such as precision, recall, accuracy, mean average precision (mAP), and frames per second (FPS), to evaluate the performance of our model and other comparative models:

1.
Precision: Precision refers to the ratio of true positives in the detected positives. Its formula is precision = TP TP+FP , where TP represents true positives, i.e., the number of targets correctly detected by the model, and FP represents false positives, i.e., the number of targets incorrectly detected by the model.

2.
Recall: Recall refers to the proportion of actual positives detected. Its formula is recall = TP TP+FN , where FN represents false negatives, i.e., the number of actual targets not detected by the model.

3.
Accuracy: Accuracy refers to the proportion of all samples (positive and negative) correctly classified. Its formula is accuracy = TP+TN TP+FP+TN+FN , where TN represents true negatives, i.e., the number of non-targets correctly judged by the model.

4.
Mean average precision (mAP): mAP is the average precision across all classes, an indicator that takes into account both precision and recall. In object detection, the AP of each class is obtained by calculating the area under the precision-recall curve.

5.
Frames per second (FPS): FPS is a critical indicator of model speed, denoting the number of frames the model can process per second. A higher value for the FPS implies a faster detection speed of the model.
These metrics were chosen as they evaluate the model's performance from various perspectives, including the model's precision, recall ability, overall performance, and running speed, etc. They allow a comprehensive understanding of the performance of our model and other comparative models in the task.

Pest Detection Results
In this section, a detailed evaluation of the proposed model and several state-ofthe-art object detection models, including RetinaDet, EfficientDet, YOLOv5, YOLOv8, FasterRCNN, and MaskRCNN, will be reported. All models were trained and tested on the IDADP dataset, and evaluation metrics such as precision, recall, accuracy, mAP, and FPS were used. The experimental results are shown in Table 1. From the table, it can be observed that all models perform well in terms of precision, recall, accuracy, and mAP, but there are significant differences in frames per second (FPS), indicating variations in processing speed among the models. RetinaDet and EfficientDet exhibit similar accuracies, but EfficientDet achieves a slightly higher mAP, likely due to its more complex model structure and more effective feature extraction capabilities. However, both models have lower FPS scores, which can be attributed to their complex structures that require longer computation times. YOLOv5 and YOLOv8 outperform RetinaDet and EfficientDet in all evaluation metrics, possibly due to the lightweight design of the YOLO series models, which strike a better balance between detection accuracy and speed. Particularly, YOLOv8 performs close to our model in terms of mAP, although with a slightly lower FPS. FasterRCNN and MaskRCNN show relatively poorer performances. While they exhibit comparable precision and recall, their mAP and FPS are lower. This may be attributed to the complexity of these models, which demand significant computational resources and result in slower processing speeds on our hardware.
The proposed model outperforms all the compared models in all evaluation metrics, as shown in Figure 7, likely owing to our three innovations: attention mechanism enhancement based on a single-stage object detection model, multi-scale feature fusion network construction, and the design of a small knowledge distillation tailored for edge computing scenarios. By introducing the attention mechanism, our model can more accurately focus on regions containing the target, thereby improving detection accuracy. Our multi-scale feature fusion network better utilizes features from different scales, further enhancing the detection performance. Through knowledge distillation, we design a small network that significantly improves the detection speed while ensuring detection accuracy, resulting in a superior FPS performance compared to all the comparative models.

Ablation Study on Attention Mechanism
In this section, we present some ablation experiments to further validate the effectiveness of our proposed methods. These experiments include ablation experiments on different attention mechanisms, performance tests on different combinations of data augmentation, and inference speed and accuracy tests on the small knowledge distillation network.
Firstly, we conducted ablation experiments on different attention mechanisms. We compared the performance of models without an attention mechanism, with an SE atten-tion mechanism, with a CBAM attention mechanism, and with our proposed attention mechanism. The experimental results are shown in Table 2. From the ablation experiments on different attention mechanisms, we can observe that both the SE and CBAM attention mechanisms perform better than the model without an attention mechanism. This indicates that attention mechanisms can indeed improve the model's focus, thereby enhancing the detection performance. Moreover, our proposed attention mechanism outperforms the SE and CBAM mechanisms in terms of performance, indicating that our attention mechanism better utilizes contextual information and effectively focuses on the target.

Ablation Study on Data Augmentation
Next, we performed performance tests on different combinations of data augmentation methods. We compared the performance of models without data augmentation, with Cutout, with Cutmix, with Mosaic, and with GAN-based data augmentation. The experimental results are shown in Table 3. From the performance tests on different combinations of data augmentation, we can see that all data augmentation methods (Cutout, Cutmix, Mosaic, and GAN-based augmentation) outperform the model without data augmentation. This indicates that data augmentation can effectively increase the model's robustness and improve detection performance. Among the tested methods, GAN-based data augmentation achieves the best performance, likely because GAN can generate more diverse data, further enhancing the model's robustness.

Ablation Study on Multi-Scale Feature Fusion
In designing the ablation study on feature multi-scale fusion, we separated the features being fused according to the level or scale and conducted experiments independently. We observed the individual effects of each scale's features, as well as the effect after their fusion. Through such an ablation study, we can understand the impact of each layer's features and multi-scale feature fusion on model performance, for example, the performance of shallow features, mid-level features, and deep features individually, as well as their performance when combined in pairs and when fused all together. This will help us understand the contribution of multi-scale feature fusion to model performance enhancement and guide us to further optimize model design and training strategies. The experimental results are shown in Table 4. According to the results of the ablation study in Table 4, we can clearly see the importance of feature multi-scale fusion in improving the performance of the model. First, it is evident that, whether using shallow features, mid-level features, or deep features, the performance of a model using any single type of features cannot match the performance of a model using all the features. This indicates that, in the pest recognition task, shallow, mid-level, and deep features all have their unique importance and are indispensable. However, from the experiment results of using shallow, mid-level, and deep features separately, the performance of the deep features model is the best, followed by the midlevel features, and the shallow features perform the worst. This suggests that, in this task, deep features (such as the overall shape, size, and other global information of pests) are crucial for pest recognition. However, this does not mean that shallow features and midlevel features are not important. As we can see, the performance of the model improves when we combine shallow features or mid-level features with deep features. This indicates that shallow features and mid-level features can provide some information that deep features cannot obtain, such as edges, color, and texture. Finally, we see that we obtain the best model performance when we combine shallow, mid-level, and deep features. This further verifies the importance of feature multi-scale fusion, i.e., combining features of multiple scales can make the model obtain more abundant and provide comprehensive information, thereby improving the performance of the model. In summary, this ablation study clearly demonstrates the important role of feature multi-scale fusion in the pest recognition task, verifying the effectiveness of our approach.

Ablation Study on Knowledge Distillation
Finally, we conducted inference speed and accuracy tests on the small knowledge distillation network. We compared the performance and inference speed of the original model with the small distilled network. The experimental results are shown in Table 5. The aforementioned experiment aimed to transfer knowledge from a large deep learning model to a smaller one. The data in Table 5 clearly reveal the outcome of this process. As we can see, the parameter count of the original model was 86.0 M, but the parameter count of the small model after knowledge distillation is only 2.28 M, significantly reducing the model's size. Similarly, the computational complexity was reduced from 35.1 G to 0.19 G. However, at the same time, metrics such as the precision, recall, and mAP have only seen slight declines. This indicates that, despite a significant reduction in the complexity of the small model, its performance on the pest detection task remains robust. The experimental results can be explained from the following perspectives:

1.
Soft labels: In the process of knowledge distillation, the teacher model provides a probability distribution for each class, known as soft labels, instead of single hard labels. Compared to hard labels, soft labels provide more detailed information, helping the student model to learn finer class distinction information.

2.
Model capacity: Although the smaller model has fewer parameters, it does not mean it cannot achieve a good performance. In fact, if the amount of data are limited, overly complex models can cause overfitting and therefore cannot achieve a good generalization ability. Through knowledge distillation, we can find a balance point, ensuring the model is not overly complex or overly simplified and thus obtaining an optimal performance. 3.
Attention transfer: In some cases, the teacher model may over-attend to unnecessary information and overlook features that are more important to the target task. Through knowledge distillation, the student model can learn these important features from the teacher model, thereby improving its performance.
In summary, the key to knowledge distillation is to leverage the knowledge of the teacher model to aid the student model in learning. This ensures that a high model performance can still be maintained even with a significant reduction in the model parameters and computational complexity. Finally, we can observe that, although the performance of the distilled small model is slightly lower than the original model, its inference speed is significantly improved. This indicates that, through knowledge distillation, we can design a model that maintains a relatively high detection accuracy while achieving a higher inference speed, which is of practical value for real-time pest detection tasks.
Compared with MobileNet, these ablation experiment results demonstrate that our proposed attention mechanism, data augmentation strategies, and knowledge distillation method contribute significantly to the performance improvement of the model. In conclusion, our proposed methods are effective, enabling the model to achieve a high detection accuracy while maintaining a high detection speed.

Exploration of Attention Focus Visualization
To further investigate the impact of the attention mechanism on the performance improvement of our model, we conducted visualizations of the attention maps. By visualizing the attention maps, we can intuitively observe the distribution of the model's attention on different regions of the input image, understand how the model works, and further demonstrate the effectiveness of our proposed model.
Firstly, it is necessary to understand the role of the attention mechanism in the model. The essence of the attention mechanism is a weight allocation strategy, where different parts of the input are assigned different weights based on their importance. In visual tasks, the attention mechanism often manifests as the model's focus on certain regions of the image. Specifically, if the model considers a particular region crucial for the task, that region will have a higher weight and the model will pay more attention to it.
In our task, we employ attention mechanism enhancement based on a single-stage object detection model. This attention mechanism automatically identifies the most important regions in the image for the task, namely, the locations of pests, assisting the model in accurately detecting these pests. During the visualization process, we present the attention distribution of the model on the image by creating heatmaps. The heatmap is a two-dimensional data visualization method that represents data magnitude through variations in color intensity. In our task, the color intensity in the heatmap represents the model's attention level on different regions of the image. The darker the color, the higher the model's attention and vice versa, as shown in Figure 8.
From our heatmaps, we can clearly see that the model's attention is mainly focused on the regions containing pests. This aligns with our expectations because pests are the most important targets in our task, and the model should primarily focus on these regions. This result indicates that our model effectively utilizes the attention mechanism to automatically identify important regions in the image, thereby improving the detection accuracy. Additionally, we notice that the model also has some level of attention on non-pest regions. This may be due to the presence of features in these regions that resemble pests or due to a data imbalance. Despite this, overall, the attention distribution of the model still aligns with our expectations.
In summary, through the visualization of the attention mechanism, we can gain insight into the model's focus and observe that the model's attention is mainly concentrated on the pest regions in the image. This result verifies the effectiveness of our attention mechanism based on a single-stage object detection model.

Conclusions
Despite the outstanding performance demonstrated in the task of rice pest detection, there are certain limitations recognized and areas identified for further research and improvement.
The proposed model has shown superior performance on the IDADP dataset compared to current state-of-the-art object detection models across various evaluation metrics, such as precision, recall, accuracy, mAP, and FPS. More specifically, a mAP of 87.5% and an FPS value of 56 were achieved, significantly outperforming other comparative models. However, the model was primarily focused on the pest areas of the images, as the primary task was pest detection. Practical applications may require the detection of other objects, such as leaf blight or the growth state of the rice. These targets may have different features in the images than pests; hence, adjustments to the model may be necessary to cater to these new tasks. Future work will explore how the model can be extended to more object detection tasks.
It was also noticed that the model showed some degree of attention to non-pest areas, possibly due to these areas containing features similar to pests or caused by a dataset imbalance. This issue could be mitigated by data augmentation methods to increase the dataset diversity and further optimize the model performance.
Although attention mechanisms based on single-stage object detection models were used, there is room for improvement. For instance, the introduction of more complex attention mechanisms, such as self-attention mechanisms, could further enhance the model performance. Moreover, the combination of attention mechanisms with other machine learning technologies, such as deep learning or reinforcement learning, could be explored to enhance the model's performance. Finally, the model was trained and tested on a specific hardware platform and software environment. However, practical applications may require the model to be used in different hardware platforms and software environments. Thus, optimization of the model for it to maintain a good performance in various environments is necessary.
In future work, the aforementioned issues will be thoroughly researched and improved upon. It is hoped that, through continuous optimization and improvement, the model will play a greater role in more scenarios and tasks.